An efficient CRISPR-Cas9 enrichment sequencing strategy for characterizing complex and highly duplicated genomic regions. A case study in the Prunus salicina LG3-MYB10 genes cluster

Background Genome complexity is largely linked to diversification and crop innovation. Examples of regions with duplicated genes with relevant roles in agricultural traits are found in many crops. In both duplicated and non-duplicated genes, much of the variability in agronomic traits is caused by large as well as small and middle scale structural variants (SVs), which highlights the relevance of the identification and characterization of complex variability between genomes for plant breeding. Results Here we improve and demonstrate the use of CRISPR-Cas9 enrichment combined with long-read sequencing technology to resolve the MYB10 region in the linkage group 3 (LG3) of Japanese plum (Prunus salicina). This region, which has a length from 90 to 271 kb according to the P. salicina genomes available, is associated with fruit color variability in Prunus species. We demonstrate the high complexity of this region, with homology levels between Japanese plum varieties comparable to those between Prunus species. We cleaved MYB10 genes in five plum varieties using the Cas9 enzyme guided by a pool of crRNAs. The barcoded fragments were then pooled and sequenced in a single MinION Oxford Nanopore Technologies (ONT) run, yielding 194 Mb of sequence. The enrichment was confirmed by aligning the long reads to the plum reference genomes, with a mean read on-target value of 4.5% and a depth per sample of 11.9x. From the alignment, 3261 SNPs and 287 SVs were called and phased. A de novo assembly was constructed for each variety, which also allowed detection, at the haplotype level, of the variability in this region. Conclusions CRISPR-Cas9 enrichment is a versatile and powerful tool for long-read targeted sequencing even on highly duplicated and/or polymorphic genomic regions, being especially useful when a reference genome is not available. Potential uses of this methodology as well as its limitations are further discussed. Supplementary Information The online version contains supplementary material available at 10.1186/s13007-022-00937-4.

to WGD, small-scale duplications (SSD), usually produced by segmental duplications and gene expansion, also play an important role in crop innovation [5,6]. Reduced selective pressure acting on duplicated genes may, in some cases, result in them acquiring novel functionalities that contribute to diversification and lead to important traits for agriculture [7][8][9][10].
Gene diversification results in high levels of diversity within gene families. For example, a substantial number of copies of celiac disease related α-gliadin genes are estimated in wheat (ranging from 25-35 to 100-150 copies per haploid genome), and are probably generated by duplication, deletion events and retrotransposon insertion [11,12]. These genes show high variability between genomes [13], with at least half of the copies being inactive or pseudogenes [14].
In both duplicated and non-duplicated genes, a large fraction of the variability is caused by large as well as small and middle scale structural variants (SVs). For example, Chawla et al. [15] found that up to 10% of all genes in Brassica napus were affected by small-to midscale SV events. In tomato, Alonge et al. [16] found multiple SVs, in many cases mediated by transposable elements (TEs); many of the SVs responsible for changes in gene dosage and expression levels modified agronomic traits such as fruit flavor, size, and production. Recent studies of SVs between and within plant species have revealed extensive genome content variation [17], leading to the pan-genome concept, which is centered on the study of the entire gene repertoire of a species by sequencing multiple individuals [18].
The major effect of duplicated genes and SVs, especially in crop agronomic traits, highlights the importance of studying the variability between genomes in these genes for plant breeding.
Whole-genome sequencing (WGS) methods have contributed enormously to broadening our understanding of the genetic basis of useful traits. However, genome complexity still represents a challenge for genome assembly and annotation. In general, large divergent duplications with less than 97% homology can be resolved by WGS assembly, whereas reads of duplications with higher homology are frequently collapsed, producing assembly errors [19].
A large number of genomes have been obtained, and SNPs and small InDels discovered using short-read high throughput sequencing (SR-HTS), but long-read high throughput sequencing (LR-HTS) is required to resolve highly homologous duplicated regions and SVs, including repetitive regions and different forms of copy number variations, such as presence-absence variants [20]. Multiple reference genomes for crop species are currently being generated using LR-HTS methods [16,21,22]; however, although the cost of long-range sequencing methods is decreasing, the availability of large amounts of high quality DNA, whole genome data storage, alignment and computation costs may be drawbacks when the objective is to resolve and explore the variability in a large panel of varieties of a unique complex loci for an interesting agricultural trait.
Over the last few years, several skimming and targetedenrichment sequencing methods have been released and successfully used to pinpoint small variants in single regions, especially in phylogenomics. However, a costeffective method to scan the variability (including SV) in a complex or highly diverse region in a panel of genotypes (for example varieties or seedlings) using LR-HTS strategies is still lacking.
Recently, a methodology has been reported for longread targeted sequencing using CRISPR-Cas9 technology to direct the sequencing adapters to the regions of interest [23,24]. In this procedure, selected regions of high molecular weight dephosphorylated DNA are cut by the Cas9 enzyme directed by specific guide RNAs (gRNAs). The digested fragments are enriched by long-read sequencing thanks to the preferential ligation of sequencing adapters towards the 5' phosphate groups generated at the cleaved ends. Gilpatrick et al. [23] used a pool of gRNAs to cut and sequence several loci associated with human breast cancer on the same run, resolving SNPs and SVs at the haplotype level. This methodology is simple to perform, as it does not require physical separation of the DNA for sequencing, cloning or amplification steps, thus allowing native DNA strands to be read, and it is not affected by amplification bias and allows visualization of allele-specific methylation patterns [23,25]. CRISPR-Cas9 enrichment has recently been used on a plant genome (as a proof of concept) showing that, at the haplotype level, it can resolve the SV that causes red flesh color in apples [24]. Although this methodology was used to sequence a short (7.8 kb) and low complex region in one apple variety, the successful results augur well for its use for fine-mapping in other situations.
The objective of this study was to evaluate the potential of CRISPR-Cas9 enrichment to identify the variability (in particular SV) in highly complex regions using a pool of DNA of highly diverse varieties. The region selected was the Japanese plum (Prunus salicina) MYB10 region located on linkage group 3 (LG3), found recently to contain at least three MYB10.1 (MYB10.1a, MYB10.1b and MYB10.c) genes, one MYB10.2 gene and one MYB10.3 gene, while additional copies could not be discarded [26]. A marker system designed in this region was able to characterize the LG3-PsMYB10 allelic organization in Japanese plum varieties and progenies into six haplotypes (H1 to H6), each with a different combination of the MYB10.1-3 alleles [26]. For example, H1 and H3 haplotypes contained three PsMYB10.1 alleles, one of them (MYB10.1a-a356) associated with the anthocyanin skin color; polymorphisms in the promoter and intron of this allele in H1 and H3 were observed, confirming the high variability of this region.
We designed seven gRNAs in conserved sites of the PsMYB10 genes and pooled them, together with two guides flanking the region, to allow for multiple Cas9mediated cuts in the DNA of five Japanese plum commercial varieties, all with a distinct and heterozygous genotype for the MYB10 region, while sharing one haplotype by pairs. The fragments were labeled with a barcode system and pooled and sequenced in a single MinION Oxford Nanopore Technology (ONT) run. Reads alignment with reference genomes or de novo alignments successfully identified SNPs and SVs, most of them newly discovered and others observed in our previous work, so validating the methodology for sequencing and variant detection of regions containing complex duplications.

The Japanese plum LG3-MYB10 region is highly complex
We used the Japanese plum LG3-MYB10 region as a model of high complexity. Japanese plum is a self-incompatible fruit tree with a complex history of interspecific crosses between diploid plums: the LG3-MYB10 region in Prunus genomes, particularly in Japanese plum, is highly diverse and contains a cluster of at least three paralogous MYB10 gene copies.
In the 'Zhongli No. 6' genome we identified two MYB10 regions 2 Mb apart in chromosome 3: one 271 kb long (from now on, Zhongli-1; LG03: 30,838,109,359) and the other 90 kb long (Zhongli-2; LG03:28,592,935-28,682,529). The Zhongli-1 region had five MYB10 genes annotated, corresponding to two copies of MYB10.1, two copies of MYB10.2 and one copy of a MYB10.3 gene. These genes were mis-annotated: the second exon was omitted in all but one of the PsMYB10.1 copies, while in this copy the STOP codon in the last exon was not detected, causing the spurious automatic identification of five extra exons along the next 6 kb of sequence. By BLAST we identified three additional genes: one was homologous to MYB10.1 and two homologous to MYB10.2. In the Zhongli-2 region, four MYB10 genes were annotated: two MYB10.1 copies, one MYB10.2 and one MYB10.3. The second exon was also mis-annotated in most of the gene copies of this region. BLAST analysis did not identify additional MYB10 genes.
To identify and visualize the differences in size and gene copy number and organization, we produced dot plots comparing the three LG3-MYB10 region assemblies (Fig. 1). These reflected the high variability and poor homology along the region, with inverted assemblies, mis-assemblies and missing fragments. The Zhongli-1 MYB10 region was assembled in an inverted orientation, when compared with 'Sanyueli' and Zhongli-2. Despite this, the Zhongli-1 (of 271 kb) and the Zhongli-2 (90 kb) alignments were collinear and with high homology (69% of sequence hits), with an evident gap of ~ 150 kb due to a missing fragment in Zhongli-2 explaining most of their difference in size (271 kb vs 90 kb). The percentage of hits between 'Sanyueli' and Zhongli-1 was lower (51%), with multiple misalignments along the region that explained the smaller size in 'Sanyueli' . Similarly, a discontinuous alignment and an inversion was also observed between Zhongli-2 and 'Sanyueli' (51% of sequence hits).
Similar comparisons were constructed for other Prunus species with more than one available reference genome (Additional Files 1, 2). The MYB10 regions of the two sweet cherry (P. avium) genomes were highly collinear with 75% of sequence hits. In peach (P. persica), the region was collinear and highly similar in the 'Lovell' and 'Chinese Cling' genomes (85% of sequence hits), and three times larger than in the wild relative P. mira, with a mean of 55.5% of sequence hits. In apricot, we compared five P. armeniaca accessions and three wild apricot species (two P. sibirica and one P. mandshurica). Homology between each P. armeniaca pairwise comparison ranged from 44 to 91% (mean of 67.7% of hits). When these were compared to the wild apricot genomes, the homology ranged from 58 to 93% (mean of 74.8% of hits) with P. sibirica, and from 47 to 59% (mean of 53.2% of hits) with the P. mandshurica genome. The mean percentage of sequence hits between the Prunus sections was 19.4% with values ranging from 8.6% to 37%.

The designed gRNAs successfully cut LG3-MYB10 genes
For selective high-throughput sequencing of the desired LG3-MYB10 region in Japanese plum varieties, we designed seven CRISPR RNAs (crRNAs) in genic regions (exons and introns) of the PsMYB10.1, PsMYB10.2 and PsMYB10.3 genes, including regions conserved between the genes to minimize the number of required crRNAs.
The crRNAs were designed at the two DNA strands to ensure forward and reverse sequences and to direct the sequencing from the inside to the outside of the gene, covering the whole region with concatenated sequences (Fig. 2). The PsMYB10.1, PsMYB10.2 and PsMYB10.3 sequences obtained in Fiol et al. (2021), the peach 'Lovell' reference genome, and other Prunus genomes were used to design the crRNAs with higher on-target capacity and lower off-target risk (see Methods). Therefore, crRNA-s1 was designed to cut within the exon 2 of MYB10.1 and MYB10.2 genes in the plus strand, while crRNA-s3 was directed to the same exon and strand in MYB10.3 (Additional File 3). To avoid a putative mis-cut due to a possible SNP in the second exon of the MYB10.2 cleavage site of crRNA-s1 (although only identified in P. cerasifera, P. domestica and P. avium) we designed crRNA-s2 for the alternative nucleotide. As reported by Fiol et al., variability in the introns was higher than in exons, so the crRNAs were designed taking into consideration the variants known. Introns of PsMYB10.1, PsMYB10.2 and PsMYB10.3 were targeted with crRNA-a1, crRNA-a3 and crRNA-a4, respectively, while crRNA-a2 was designed to include an SNP variant in one of the PsMYB10.1 alleles present in the sample (PsMYB10.1-H1 in [26]). In addition, to cut the MYB10 genomic region at each flank, two crRNAs (crRNA-f1 and crRNA-f2) were designed in   Guide-RNAs assembled with the trans-activating CRISPR RNA (tracrRNA) and crRNA-s (s1, s2 or s3) cut the exon 2 of the gene, gRNAs with crRNA-a (a1, a2, a3 or a4) cut by intron 1. Fragment 6 was not obtained after cleavage, fragment 3 and 5 bands overlapped in the gel to the preferential amplification of the shorter one). The fragments were cut and visualized in five bands. Three of them corresponded to the fragments 2 (526 bp), 3 (1078 bp), 4 (613 bp) and 5 (991 bp) in the in-silico design shown in Fig. 3b, with fragments 3 and 5 overlapping in the gel. The two faint additional bands corresponded also to fragments 3 and 5 (of about 2.5 kb) but of the allele with the large intron 2, and to an undigested fraction of the product of the preferentially amplified copy. Similar results were observed for the MYB10.2 amplicon, which was cut into four fragments. The band corresponding to the small gene fragment (fragment 6 in Fig. 3b) spanning the exon and intron crRNA sites was not produced in either the MYB10.1 or the MYB10.2 cleavages. The pool was not able to cut the MYB10.3 amplicon. A posterior sequencing of that amplicon revealed an unexpected 210bp deletion affecting part of intron 1 and all of exon 2, and so lacking the two scission target sites.

Sequencing, alignment to the reference genomes and variant calling
The genomic DNA of five plum varieties (Table 1) was cleaved with the pool of gRNAs and barcoded. A pool of the output reactions was sequenced with the ONT Min-ION device, which produced 194 Mb of sequences in 33,794 reads of average mean length of 6,097 bp (Table 1). After demultiplexing the reads into varieties through their barcode, we obtained sequence yields ranging from 29.32 Mb to 45.64 Mb in reads of mean length from 4,110 to 7,778 bp; the average N50 was 14,510 bp.
The reads of each of the five varieties were aligned against the 'Sanyueli' and 'Zhongli No. 6' genomes. The coverage and depth of the alignments within the MYB10 region varied for each sample and reference genome used ( Table 2, Additional File 4). Higher coverage was observed for the alignments against the shorter region of Zhongli-2 (90 kb long, 63.4% of coverage in average) while the coverage of the alignments against Zhongli-1 (54.2% of coverage, 271 kb long) was only slightly lower than that for 'Sanyueli' (57% of coverage, 135 kb long). The on-target alignment did not correlate with the number of reads yielded, which was uneven for each variety. In particular, 'Golden Japan' was the variety with the lowest number of on-target aligned reads (168, 67 and 45 in 'Sanyueli' , Zhongli-1 and Zhongli-2, respectively). The mean sequence depth was also unequal for each  variety-reference genome combination. In general, for all varieties, a higher depth and higher number of on-target aligned reads was observed with 'Sanyueli' ( Table 2). The variety with the lowest depth was 'Golden Japan' due to the poor alignment, while 'TC Sun' performed better. The off-target regions in 'Golden Japan' corresponded mostly to small regions (< 1 kb) scattered all over the genome, homologous to the chloroplast DNA sequence and aligning reads clipped from both sides. The same off-target reads were observed aligned to these regions in the other sample sequences, indicating that it was not specific of 'Golden Japan' and the low-depth in its MYB10 region was caused by a poor Cas9 digestion.
In general, the sequence depth was unevenly distributed along the MYB10 region, with scattered regions of high depth (Additional File 4). This was caused by sequences that aligned partially and were clipped from one or both sides, indicating low homology with the genomes used as reference. Sequence clipping was not observed in 'TC Sun' aligned to Zhongli-1.
Variants were called for the alignment with 'Sanyueli' , the one with the highest depth. A total of 3,261 SNPs in the MYB10 region were found, 62.34% of them (2,033) being present in at least two varieties (Fig. 4a), 98.5% with the same alternative allele (Additional File 5). All six observed MYB10 haplotypes (H1-H6) defined by Fiol et al. (2021) plus one inferred (H9) were present in the five varieties, three of them shared by pairs (H1 by ' Angeleno' and 'Black Gold'; H3 by ' Angeleno' and 'Fortune'; and H4 by 'TC Sun' and 'Golden Japan'). This was used to assign the phase of some of the polymorphisms: ' Angeleno' and 'Black Gold' had 259 private SNPs which may correspond to the shared H1; ' Angeleno' and 'Fortune' had 62 that could correspond to the shared H3. 'Fortune' and 'Black Gold' had 87 private SNPs which should be in H2 and H6. Unfortunately, due to the poor quality of the 'Golden Japan' sequencing the phase of the SNPs in H4, H5 and H9 could not be inferred.
The accuracy of the pipeline was tested by searching a previously Sanger-validated insertion of 44 bp in the PsMYB10.1a promoter in the haplotypes associated with red skin color (H1, H3) and in H9. Sniffles detected the polymorphism in homozygosity on ' Angeleno' (H1/H3) and in heterozygosity on 'Black Gold' (H1/H2); NanoVar could detect it even in low depth on 'Golden Japan' (H4/H9). The insertion was not called in 'Fortune' (H3/H6), although some reads were found in low frequency.

de novo assembly and homology visualization
Despite being able to identify new polymorphisms with this method, we carried out a de novo assembly to overcome the low coverage caused by sequence clipping in the regions with low homology. First we constructed an assembly for each variety, then we used the 'Zhongli No. 6' and 'Sanyueli' genomes to scaffold the contigs along the region. As expected, the degree of homology and collinearity varied for each assembly (Additional File 6). In general, higher homology was observed in comparisons of the assemblies of the five varieties sequenced with the CRISPR-Cas9 selective cleavage strategy, particularly in those sharing a haplotype. In particular, the assemblies of ' Angeleno' and 'Black Gold' were always the most similar, with hits ranging from 41 to 53%, depending on the reference region used for assembly.

Identification of polymorphisms in the PsMYB10.1a promoter
Given the low homology between the samples and the 'Zhongli No. 6' and 'Sanyueli' genomes, we searched for polymorphisms among the de novo assemblies only. We used as an example a 250 bp contig upstream PsMYB10.1 (previously sequenced with Sanger by Fiol et al. [26]), which included the 44 bp sequence previously used to validate the SV call, plus a 41 bp insertion and several SNPs differing between haplotypes. We successfully identified the SNPs and InDels in heterozygosis, among them the 41 bp deletion in H3 and the 44 bp insertion in the H1, H3 and H9 haplotypes. The former (position 66 to 76 in the contig, Fig. 5) was shared by ' Angeleno' (H1/H3) and 'Fortune' (H3/H6). The latter (position 200 to 243 in the alignment, Fig. 5) was present in the varieties with H1, H3 or H9, in agreement with the Sanger sequencing results of Fiol et al. [26].
Most of the SNPs detected were consistent with those previously found, however we identified a few sequencing errors. This was the case of the SNPs in positions 52, 137, 175, and 244-245 in the alignment (Fig. 5). Despite these few sequencing errors, intrinsic to the ONT sequencing technology, the results validate this methodology for the resolution and identification of variability in highly complex regions.

Discussion
In this work we confirm sequence enrichment CRISPR-Cas9 selective cleavage as a cheap and efficient high throughput strategy to identify variants and validate its use in highly complex regions in heterozygous genomes. Complex regions frequently include small and large variants, transposable elements (TE) and clusters of tandem or segmental duplicated multi-copy genes. Such genes are more likely to carry higher levels of variation [31] and be affected by presence/absence variation (PAV) than singleton genes [32], and often intervene in processes related to the evolution of agronomic traits [31,[33][34][35][36]. This makes characterizing such regions highly relevant for evolutionary studies as well as for plant breeding [37]. It is difficult, or even impossible, to completely resolve plant genome complexity by short read sequencing. Long-read approaches have improved the contiguity of genome assemblies, while special effort is currently being made to resolve particularly complicated regions of relatively small sizes in selected genotypes. In humans, several authors have used a strategy based on enrichment and sequencing using targeted cleavage of chromosomal DNA with Cas9 combined with Nanopore sequencing to resolve genomic SVs as well as mobile elements [23,[38][39][40]. In plants, this strategy has been validated to fine map short regions [24] and to characterize the transposon insertion landscape [41]. Here we extend this strategy for the characterization of much longer regions in a pool of genotypes in species without a reference genome or in regions with large variability in gene content.
As the region of interest (ROI), we selected the Prunus chromosome 3 MYB10 region, which consists of a cluster of MYB10 gene copies with a gene content variable among Prunus species.
To reflect the complexity within the ROI we analyzed the synteny within and between Prunus sections, including two recent P. salicina genomes (one for the variety 'Sanyueli' and one for 'Zhongli No. 6') that were released when our study was already at a very advanced stage. Despite belonging to the same species, the level of homology in this region of these two Japanese plum varieties was low, comparable to that observed between cultivated and wild members from the same section of apricot and peach. By contrast, the homology within each of the Prunus sections studied (apricot, sweet cherry and peach) was higher or moderately higher. Such low level of homology in the ROI within P. salicina could be attributable to the interspecific origin of the species together with a putative mis-assembly of the region in the 'Zhongli No. 6' genome, in which two LG3-MYB10 regions were assembled 2 Mb apart instead of the single one expected for Prunus. The low identity between the Prunus ROI contrasts with the high synteny described at the whole genome level within the Prunus genus [42] and confirms the high complexity of this region in Prunus in general and in Japanese plums in particular.
Previously in Fiol et al. [26], we observed six haplotypes from the segregation of PsMYB10 alleles in six F1 Japanese plum progenies and advanced breeding lines, finding high allelic variability and the triplication of the MYB10.1 gene in some haplotypes. Using the limited homology between Prunus genomes and by designing numerous PCR primer combinations, we were able to clone some of the alleles and phase the variants. This was a slow and costly strategy, and we were not able to obtain the full sequence of the haplotypes. In contrast, here we complemented the Cas9 enrichment strategy with a multiplex approach to decrease the cost of sequencing six of the observed haplotypes (named H1 to H6), which allowed to quickly obtain high-quality sequences along the MYB10 region for each of them.

Design of the crRNAs
The MYB10 proteins belong to one of the largest transcription factors (TF) groups, the R2R3-MYB class, one of the four MYB gene classes in seed plants [43]. As occurs with most multi-copy gene families, they share conserved motifs within the gene sequence. Such regions can be successfully used to direct crRNAs. We designed seven non-overlapping forward and reverse crRNAs able to provide sequences of the genes and their upstream and downstream regions. This work was started several months before the two Japanese plum genomes were released. Although their availability could have helped in designing the crRNAs, our results indicate that this strategy can be successfully applied to those species lacking a reference genome.
Due to the small length of the MYB10 sequences available and the high variability in the region, the on-target activity of the crRNAs was far from optimal (51 to 68, on a scale of 1-100). Their cleavage activity was tested in vitro with MYB10 amplicons of the variety ' Angeleno' . The results revealed that while cleavage would likely not perform well for in vivo gene editing, they were sufficient for in vitro cleavage. In addition, in vitro digestion revealed that Cas9 cannot simultaneously excise two targeted regions when these are only a few base pairs apart (about 100 bp in our design). This may be due to the Cas9 remaining bound in the cleaved DNA sequence on the 5'-side of the gRNA, allowing preferential ligation of the sequencing adapters to the available end [23,44,45].
The pool of crRNAs did not cleave the MYB10.3 amplicon tested because one of the gene copies missed both cleavage target sites, revealing the existence of a new allele, probably an unknown duplicated MYB10.3 copy. When we later visualized the genome alignments, reads were observed starting from the MYB10.3 annotated position in all the samples, indicating that the Cas9 could indeed cleave this gene.

Sequencing and alignment to reference genomes
The crRNAs were pooled to cleave the DNA of each variety in a single reaction. This strategy made it possible to enclose the ROI and cleave the MYB10 genes along the whole region, regardless of their copy number and organization, which was unknown.
Pooling the barcoded cleaved DNAs lowered the sequencing costs, although it reduced the number of reads obtained per variety, which could have risked the depth of coverage. We obtained 38.8 Mb per variety, on average, which represents 288 times the number of bases in the ROI of 'Sanyueli' , 433 in that of Zhongli-2 and 143 times in Zhongli-1. The depth and coverage obtained after alignment were lower than expected from these figures. The lowest depth values were for the alignments with the 'Zhongli No. 6' genome, which was reasonable considering that sequences were aligned to the two assembled MYB10 regions, therefore reducing the depth at each of them. However, the depth in the alignments with 'Sanyueli' was only slightly better (from 12.2 × to 29.1x, if we dismiss the extremely low values of 'Golden Japan'). The depth is highly determined by the percentage of coverage and of the on-target reads. We did not observe a correlation between the number of reads and the coverage nor with the on-target ratios. Indeed, the sequencing statistics for 'Golden Japan' were similar to those obtained for the other varieties, while all the alignment values were extremely low, indicating a poor enrichment probably caused by an inefficient Cas9 cleavage reaction.
The ratio of on-target alignments (mean of 7.13% in 'Sanyueli' after discarding 'Golden Japan' value) was higher than in other studies (4.61% for multiple cuts in Gilpatrick et al. [23]; 2.08 and 3.04% in López-Girona et al. [24]. This could be attributable to the higher number of cleavage sites designed, generating more Cas9digested DNA fragments with compatible ends for their posterior ligation with the sequencing adapters. This conclusion was also reached by McDonald et al. [40], who obtained a mean on-target value of 44% in the CRISPR Cas9-enrichment of TEs.
Since the complexity of the ROI is out of our control, higher depths would be more easily obtained by increasing the number of cleavage sites. As the crRNAs are pooled in a single cleavage reaction, the cost of the method is only increased by the cost of the crRNAs employed, which is a small proportion of the total cost. In addition, we consider that by reducing the number of pooled varieties, and therefore increasing the number of reads, we would not have obtained much higher coverage or depth. The complexity of the region, its size, and the number of cutting sites need to be considered in deciding the number of pooled varieties.

Variant identification in the alignments to a reference genome
The strategy described here can be successfully used to rapidly extract useful information for marker-assisted selection. The generation of long reads offers a higher mapping certainty [46], which is extremely useful for regions with segmental duplications to pinpoint sequence variants with high confidence levels.
We identified SNPs and SVs along the ROI involved in the regulation of fruit skin color. For SNP calling we used Longshot software [47], which not only considered the error tendency of the ONT long-reads to avoid false positives but has also been reported to outperform the SNP calling method using only short-read sequences in regions with duplications. A total of 129 SNPs were shared by the three cultivars with red-to-purple skin and absent in the two with yellow hues; since their phases are known, they could be selected, after validation, as alternatives to the haplotype (PCR) marker provided in Fiol et al. [26].
For SV calling we used Sniffles software [29] complemented with NanoVar [30] to overcome the low depth of 'Golden Japan' . The results indicate that the tools complemented each other well in overcoming their limitations, and that SVs could still be called even in low depth. In addition, we were able to phase the variants, finding the 8 bp insertion in exon 1 of the PsMYB10.1a gene in the H9 of 'Golden Japan' in phase with the 44 bp insertion associated with red skin. This 8 bp insertion produces an early STOP codon, which explains the yellow skin of this cultivar despite it carrying the 44 bp insertion (Additional File 7).

De novo assemblies
We have shown that reads can be aligned to a reference genome or assembled de novo in the case of its absence or of bad homology. Here, to recover the positions lost due to the lack of homology with the reference genomes, we used the pipeline for read correction and subsequent de novo assembly provided in López-Girona et al. [24]. One of the main limitations in our study was that reads produced from two very distant adjacent genes had little chance to overlap. This was the case in the Zhongli-1 region, where two adjacent genes 124 kb apart left 55 kb of the region uncovered. The other limitation for obtaining a whole-region assembly was that reads could not overlap the cleavage positions because these were produced in a single reaction. Therefore, despite reducing Table 3 Strengths, weakness, improvements and new opportunities of the CRISPR-Cas9 sequencing enrichment strategy described in this study

New opportunities
The design of crRNAs does not necessarily require a reference genome Pooling the crRNAs and barcoded DNAs results in a simple and cost-effective method Polymorphisms can be extracted and phased with high efficiency from genome alignments and/or de novo sequences The method is computationally inexpensive The search for polymorphisms in the de novo sequences might require the manual isolation and comparison of contigs ONT technology is prone to sequence errors in homopolymer regions [48] The Cas9 digestion with a single crRNA pool cannot produce overlapping fragments Digesting the DNAs with sub-pools of crRNA might improve the region assembly, at the expenses of simplicity and cost-effectiveness The use of samples homozygous for the target region might increase the efficiency of polymorphism discovery and sequence scaffolding Identification of methylation variants between the pooled samples in the target region [23,25] Enrichment of genes scattered in more than one region of interest, such as the enrichment of specific gene or transposon families [41] the cost, cleaving in a single reaction reduced the accuracy of the assembly, meaning that there should be a compromise between cost and coverage of the assembly when designing the experiment.

Variant identification in the de novo assemblies
The pipeline used to identify variants between de novo assemblies succeeded in the identification and assignment of polymorphisms into haplotypes in three of the samples rather than compiling them, improving the results obtained when using a reference genome (the 44 bp insertion was detected in 'Fortune'). However, in cases of high similarity between haplotypes or low depth coverage, the use of a reference genome performs better. Therefore, where a reference genome is available, the alignment to it and de novo assemblies should be combined to ensure all polymorphisms are identified.

Strengths, weaknesses, improvements, and opportunities
We present here an economic and powerful strategy to extract sequences and polymorphisms from highly complex regions which may contain duplicated and presence/ absence variants (PAVs). The main strengths and weakness of this strategy is summarized in Table 3.

Conclusions
The CRISPR-Cas9 enrichment strategy enabled long-read sequences from a highly polymorphic and duplicated region to be produced and polymorphisms related to fruit skin color to be deduced. The methodology was simplified through using a pool of crRNAs in duplicated conserved regions and the sequencing cost was optimized by multiplex sequencing while keeping the protocol computationally inexpensive. The tool could be further improved by considering methylation patterns and also adapted for the full assembly of complex regions, being also appropriate for those crops with a lack of a reference genome or with one unrepresentative for the genomic region of interest.

Plant material and nucleic acid isolation
Young leaves from five commercial Japanese plum varieties (Table 4) were collected and kept at − 80 ºC. Prior to DNA extraction, nuclei were separated as described in Naim et al. [49] and detailed online [50], but the vegetal material was homogenized by grinding in liquid nitrogen using a pestle and mortar. DNA was extracted from the isolated nuclei using the Doyle CTAB method [51], introducing an RNAse treatment before the chloroform centrifugation step.
The quality of the extracted DNA was evaluated using Nanodrop and low voltage 0.8% agarose TAE gels, and quantified using a Qubit fluorometer.

Design of guide RNAs targeting conserved sites of MYB10 genes
The Alt-R CRISPR-Cas9 System (IDT ® ) was selected to design and assemble the CRISPR RNAs (crRNA) and the trans-activating crRNAs (tracrRNA) into functional guide RNAs (gRNAs).
The crRNA sequences were designed in the desired positions with the Alt-R ® Custom Cas9 crRNA Design Tool, which listed the potential crRNAs by their calculated on-target activity and sequence position. Those with the highest on-target capacity were BLAST to the Peach genome v2.0.a1 [52] and the Sweet cherry genome v1.0 [53] to discard gRNAs with off-target risk regions. SNP variability identified in other genomes was also considered in the crRNA design. Those gRNAs directed at the most highly conserved LG3-MYB10 sequences were finally selected.

Assembly of the Cas9 RNA-Ribonucleoprotein complex (RNP)
The designed crRNAs and the tracrRNA were resuspended to 100 µM in Nuclease-Free Duplex Buffer (IDT). To form the gRNA duplex, all the crRNAs were pooled in an equimolar mix and combined with the same volume of tracrRNA to a final concentration of 10 µM. The mixture was denaturalized for five minutes at 95 ºC and allowed to cool at room temperature. To assemble the Cas9-RNP complex for a single cleavage reaction, 1 µL of the gRNA duplex and 0.08 µL of Alt-R ® S.p. HiFi Cas9 Nuclease V3 (IDT) were mixed with 1X of CutSmart Buffer (NEB) and nuclease-free water to a total of 10 µL then incubated for 30 min at room temperature and kept on ice for immediate use.

Cleavage assay of the designed gRNAs with PCR amplicons
The assembled Cas9-RNPs activity was tested using PCR amplicons of PsMYB10.1a, PsMYB10.2 and PsMYB10.3 full gene sequences from ' Angeleno' . The primer sequences and their annealing temperature (Ta) are listed in Additional File 8. Each PCR reaction contained 1.5 mM MgCl 2 and 1X NH 4 buffers, 1U of BioTaq polymerase (Bioline), 0.2 mM dNTPs, 0.2 µM of each primer, 40 ng of DNA from ' Angeleno' and MilliQ water to a total of 25 µL. Thermocycler conditions were 94 ºC for 1 min, 35 cycles of 94 ºC for 30 s, the Ta described for each primer pair for 20 s and 72 ºC for 2 min, followed by a final extension step of 5 min at 72 ºC. Reactions were purified using ExoSap-IT (Thermo Fisher) and then 3 µL of each amplicon was diluted with 32 µL of nuclease-free water before adding 10 µL of assembled Cas9-RNP. Cleavage reactions were incubated at 37 ºC for 20 min and five minutes at 72 ºC, then loaded in a 2% agarose TAE gel to visualize the cleaved bands compared to the uncut amplicons.

Comparison of the MYB10 region in several Prunus whole-genome assemblies
A total of fifteen genomes were added to the analysis: two Japanese plums, eight apricots, two sweet cherries, two peaches and a wild peach relative. For all the genomes, the MYB10 region was considered from the BLAST [54] regions of crRNA-f1 and crRNA-f2 sequences. All genomes and the regions considered in this study are listed in Additional File 2. The homology between the regions was calculated and visualized pairwise using the DECIPHER v2.0 [55] package for R version 4.0.3 [56]. The MYB10 genes in the two Japanese plum genomes were identified by BLAST using the PsMYB10 sequences from Fiol et al. [26].

CRISPR-Cas9 enrichment, library preparation and sequencing
The high-quality nuclei DNA from the five selected commercial plum varieties and the pooled crRNAs (100 µM) were sent to CNAG (Centre Nacional d' Anàlisi Genòmica, Barcelona) for CRISPR-Cas9 enrichment targeted sequencing using Oxford Nanopore Technology (ONT). For each sample, 1000 ng of DNA diluted in 27 µL of nuclease-free water was dephosphorylated by the addition of 3 µL of 10X CutSmart Buffer and 3 µL of Quick Calf Intestinal Phosphatase (NEB) and then incubated at 37 ºC for 10 min, 80 ºC for two minutes and 20 ºC for two minutes. To each reaction, 10 µL of previously assembled Cas9-RNP complex was added together with 1 µL of dATP (10 mM) and 1 µL of Taq polymerase (NEB) and incubated at 37 ºC for 20 min and at 72 ºC for five minutes for A-tailing. Samples were tagged using barcodes NB01 to NB05 of the Native Barcoding Expansion 1-12 PCR-free kit (EXP-NBD104, ONT) and purified using Agencourt AMPure XP beads (Beckman Coulter). The barcoded samples were quantified using a Qubit fluorometer and pooled equimolarly to reach 700 ng in 65 µL. The Cas9 Sequencing Kit (SQK-CS9109, ONT) was used to ligate the AMX adapters before purification of the fragments with AMPure XP Beads. The beads were washed twice with Long Fragment Buffer (LFB) and then resuspended in Elution Buffer (EB) for 30 min at room temperature. The sequencing library was prepared by mixing 12 µL of the DNA library, 37.5 µL of Sequencing Buffer (SQB) and 25.5 µL of Loading Beads (LB) (SQK-CS9109, ONT). The sample of pooled fragments was sequenced in a MinION (v9.4.1) flowcell on a GridION mK1 device operated by MinKNOW version 3.6.5 software.

Reads correction, Flye assembly and reference-guided scaffolding
Canu v 2.1.1 [63] was used to correct the trimmed reads and perform an initial de novo assembly. Nanopolish v0.11.1 [64] was used to improve the assembly results. For each sequenced cultivar, their Canu corrected reads and polished contigs were used as input for Flye assembler v2.8.3 [65] enabling the option to keep haplotypes. The resulting contigs were genome-guided scaffolded with the RagTag software [66], using separately the three MYB10 regions from the two P. salicina available genomes as template reference sequences. DECIPHER v2.0 [55] was used to display the homology between the references and reference-guided sequences in a pairwise fashion and obtain their similarity values.

Identification of SNPs and InDels in the PsMYB10.1a promoter
A region of 250 bp in the PsMYB10.1a promoter containing several SNPs and large InDels was used for comparison. Sanger sequences of this region obtained for different haplotypes [26] were used in a BLAST analysis to recover the haplotypes in the de novo (Flye) contigs. In the case of the H4 sequence, which could not be recovered, reads were isolated from their alignment to the 'Zhongli No. 6' genome PsMYB10.1a promoter region and assembled manually. No Sanger sequences were available for H5 and H6, and their contigs on 'TC Sun' and 'Fortune' were isolated as alternative to H4 and H3 sequences. The selected contigs from each cultivar were aligned and compared using Sequencher 5.0 (Gene Codes Corporation, Ann Arbor, MI, USA).